class: center, middle, inverse, title-slide # Introduction to R for Data Analysis ## Exploratory Data Analysis ### Johannes Breuer & Stefan Jünger ### 2021-08-04 --- layout: true --- ## What is Exploratory Data Analysis (EDA)? After wrangling our data, the next thing we should do is exploring them. In practice, of course, these steps are often done iteratively. There is no single definition of what exploratory data analysis is and which steps and methods belong to this category. For example, while descriptive (or summary) statistics are almost always part of EDA, analysis methods like correlation, chi-square tests, or principal component analysis (PCA) can be part of exploratory or confirmatory analyses (which we will cover in the next session). --- ## What is Exploratory Data Analysis (EDA)? For this course, we specifically decided not to separate the analysis sessions into descriptive vs. inferential statistics (in parts also because, as we pointed out in the Introduction, this is not a statistics workshop), so we will also cover some inferential statistics in the EDA part. Instead, the distinction we make between EDA and confirmatory analysis is more akin to that between exploratory and confirmatory research: While the former can be employed to develop hypotheses, the goal of the latter typically is to test hypotheses.<sup>1</sup> .small[ .footnote[ [1] Of course, this does not mean that EDA techniques cannot be used to answer (exploratory) research questions. ] ] --- ## Exploring EDA As stated before, exploratory data analysis can take many shapes and forms. In this session, we will focus on the following: - summary statistics (for numeric variables) - frequencies & proportions (for categorical variables) - cross-tabulations & correlations - exploring missing values and checking for outliers *Note*: In this session, we will focus on numeric summaries and descriptions of our data. Notably, data visualization typically also is a big part of EDA. We will cover that in the following session on data visualization with `R`. --- ## Disclaimer: A flood 🌊 of packages 📦 In the previous sessions, we have tried to focus on a small set of packages (mostly from `base R` and the `tidyverse`) and took a deep dive 🥽 exploring their functionalities. In this session, however, we will do browse through quite a few packages for EDA. Admittedly, this can be a bit overwhelming. However, there is no need to worry: You do not need to remember all of the options and their differences right away. The topic of EDA is a very suitable example for demonstrating that there often are many `R` packages for the same tasks. --- ## Choosing packages 📦 for EDA Many of the EDA options that we are going to discuss in this session are quite similar across the different packages. However, there are differences, e.g., regarding the type of output they produce, their ease-of-use, and their flexibility. In practice, you will probably not need more than 1 to 3 packages for EDA most of the time. It is, however, worth exploring the different options to find packages that (best) suit your needs. Of course, what the best options are strongly depends on your goals and the output you want to get. For example: Do you only want to quickly check the data for yourself, create a report or documentation for internal use, or do you want to produce publication-ready output (tables, plots, etc.)? --- ## Data As using the full data set can become somewhat unwieldy for the examples in this section, we will create/use a subset of the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* data. We will select a subset of variables on the following: - demographics - political orientation - risk perceptions regarding infection with and spreading of the Corona virus - personal prevention measures - trust in people and institutions with regard to dealing with the Corona virus - sources of information about the Corona virus --- ### Repetition: Data wrangling pipeline As a repetition and reminder, we will quickly go through a wrangling pipeline for these data in the following. *Note*: Of course, it is possible to do the whole wrangling in one pipe. However, to check if everything worked it is advisable to break up the pipe into smaller chunks (a nice tool for checking and debugging pipes that also provides an *RStudio* Addin is the package [`ViewPipeSteps`](https://github.com/daranzolin/ViewPipeSteps)). Also, splitting up the wrangling pipe steps allows us to show them on the slides. --- ## Wrangling pipeline: Select & rename .small[ ```r gp_covid <- read_csv2("./data/ZA5667_v1-1-0.csv") ``` ```r corona_survey <- gp_covid %>% select(id, sex:education_cat, choice_of_party, left_right = political_orientation, risk_self = hzcy001a, risk_surroundings = hzcy002a, avoid_places = hzcy006a, keep_distance = hzcy007a, wash_hands = hzcy011a, stockup_supplies = hzcy013a, reduce_contacts = hzcy014a, wear_mask = hzcy015a, trust_rki = hzcy047a, trust_government = hzcy048a, trust_chancellor = hzcy049a, trust_who = hzcy051a, trust_scientists = hzcy052a, info_nat_pub_br = hzcy084a, info_nat_np = hzcy086a, info_loc_np = hzcy089a, info_fb = hzcy090a, info_other_sm = hzcy091a) ``` ] --- ## Wrangling pipeline: Missing values If you look at the [codebook](https://dbk.gesis.org/dbksearch/download.asp?id=67378) for the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany*, you will see that some of the variables we have selected have specific values that we should either code as missing values or exclude before we do any analyses (exploratory or otherwise). .small[ ```r library(naniar) corona_survey <- corona_survey %>% replace_with_na_all(condition = ~.x < 0) %>% replace_with_na(replace = list(choice_of_party = c(97,98), risk_self = c(97), risk_surroundings = c(97), trust_rki = c(98), trust_government = c(98), trust_chancellor = c(98), trust_who = c(98), trust_scientists = c(98))) ``` ] --- ## Wrangling pipeline: Change variable types & recode values .small[ ```r corona_survey <- corona_survey %>% mutate(sex = recode_factor(sex, `1`= "Male", `2` = "Female"), education_cat = recode_factor(education_cat, `1` = "Low", `2` = "Medium", `3`= "High", .ordered = TRUE), age_cat = recode_factor(age_cat, `1`= "<= 25 years", `2`= "26 to 30 years", `3` = "31 to 35 years", `4` = "36 to 40 years", `5` = "41 to 45 years", `6` = "46 to 50 years", `7` = "51 to 60 years", `8` = "61 to 65 years", `9`= "66 to 70 years", `10` = ">= 71 years", .ordered = TRUE), choice_of_party = recode_factor(choice_of_party, `1`= "CDU/CSU", `2`= "SPD", `3` = "FDP", `4` = "Linke", `5` = "Gruene", `6` = "AfD", `7` = "Other")) ``` ] --- ## Wrangling pipeline: Compute new aggregate variables ```r corona_survey <- corona_survey %>% rowwise() %>% mutate(sum_measures = sum(c_across(avoid_places:wear_mask)), sum_sources = sum(c_across(info_nat_pub_br:info_other_sm)), mean_trust = mean(c_across(trust_rki:trust_scientists), na.rm = T)) %>% ungroup() ``` --- ### Saving the wrangled data As we will use this reduced and wrangled data set again in the next sessions, we should save it (as an `.rds` file). ```r saveRDS(corona_survey, "./data/corona_survey.rds") ``` We can then (later on) load the data set again with ```r corona_survey <- readRDS("./data/corona_survey.rds") ``` --- ## Explore your data: First look 👀 To get a first impression of the data set, we can use some of the functions we discussed in the sessions on *Data Import & Export* and *Data Wrangling Basics*, such as `dim()`, `head()`, or `str()` from `base R`, `glimpse()` from `dplyr`, or `View()` in *RStudio*. While looking at the the full data set can give us a general understanding of the data and their format and also show if (and how) we may need to wrangle them (further), it is difficult to make sense of the data just by looking at it. --- ## Summary statistics To make sense of quantitative data we can reduce their information to unique values. -- .center[ ~ **That's a simple definition of summary statistics** ~] -- As such, we can use summarizing functions of - location (e.g., the mean), - spread (e.g., standard deviation), - the shape of the distribution (e.g., skewness), and - relations between variables (e.g., correlation coefficients) --- ## Summary statistics: `summary()` A quick and easy way to check some summary statistics for your dataset is the `base R` function `summary()` which can be applied to individual variables... ```r summary(corona_survey$left_right) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.00 3.00 5.00 4.66 6.00 10.00 87 ``` as well as whole dataframes: ```r summary(corona_survey[, 2:19]) ``` .right[↪️] --- class: middle .small[ ``` ## sex age_cat education_cat choice_of_party left_right risk_self risk_surroundings avoid_places ## Male :1933 51 to 60 years:978 Low : 423 Gruene :756 Min. : 0.00 Min. :1.000 Min. :1.000 Min. :0.0000 ## Female:1832 61 to 65 years:386 Medium:1154 CDU/CSU:753 1st Qu.: 3.00 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:1.0000 ## >= 71 years :382 High :2188 SPD :369 Median : 5.00 Median :4.000 Median :5.000 Median :1.0000 ## 46 to 50 years:367 Linke :287 Mean : 4.66 Mean :4.094 Mean :4.552 Mean :0.8446 ## 66 to 70 years:357 AfD :281 3rd Qu.: 6.00 3rd Qu.:5.000 3rd Qu.:6.000 3rd Qu.:1.0000 ## 36 to 40 years:328 (Other):337 Max. :10.00 Max. :7.000 Max. :7.000 Max. :1.0000 ## (Other) :967 NA's :982 NA's :87 NA's :613 NA's :661 NA's :579 ## keep_distance wash_hands stockup_supplies reduce_contacts wear_mask trust_rki trust_government ## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :1.000 Min. :1.000 ## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:4.000 1st Qu.:3.000 ## Median :1.0000 Median :1.0000 Median :0.0000 Median :1.0000 Median :0.0000 Median :5.000 Median :4.000 ## Mean :0.8029 Mean :0.9105 Mean :0.3214 Mean :0.8547 Mean :0.0367 Mean :4.441 Mean :3.663 ## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:5.000 3rd Qu.:4.000 ## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :5.000 ## NA's :579 NA's :579 NA's :579 NA's :579 NA's :579 NA's :670 NA's :631 ## trust_chancellor trust_who trust_scientists ## Min. :1.000 Min. :1.000 Min. :1.000 ## 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.000 ## Median :4.000 Median :4.000 Median :4.000 ## Mean :3.573 Mean :3.968 Mean :4.239 ## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000 ## Max. :5.000 Max. :5.000 Max. :5.000 ## NA's :635 NA's :662 NA's :658 ``` ] --- ## Summary statistics: `psych::describe()` For more detailed summary statistics for the numeric variables you can use the `describe()` function from the [`psych` package](https://cran.r-project.org/web/packages/psych/index.html). ```r library(psych) corona_survey %>% select(where(is.numeric), -id) %>% describe() ``` .right[↪️] --- class: middle .tinyish[
vars
n
mean
sd
median
trimmed
mad
min
max
range
skew
kurtosis
se
1
3.68e+03
4.66
1.86
5
4.68
1.48
0
10
10
-0.103
-0.163
0.0307
2
3.15e+03
4.09
1.27
4
4.11
1.48
1
7
6
-0.0457
-0.12
0.0226
3
3.1e+03
4.55
1.4
5
4.59
1.48
1
7
6
-0.29
-0.214
0.0252
4
3.19e+03
0.845
0.362
1
0.931
0
0
1
1
-1.9
1.62
0.00642
5
3.19e+03
0.803
0.398
1
0.878
0
0
1
1
-1.52
0.317
0.00705
6
3.19e+03
0.911
0.285
1
1
0
0
1
1
-2.88
6.27
0.00506
7
3.19e+03
0.321
0.467
0
0.277
0
0
1
1
0.764
-1.42
0.00828
8
3.19e+03
0.855
0.352
1
0.943
0
0
1
1
-2.01
2.05
0.00624
9
3.19e+03
0.0367
0.188
0
0
0
0
1
1
4.92
22.3
0.00333
10
3.1e+03
4.44
0.774
5
4.59
0
1
5
4
-1.68
3.54
0.0139
11
3.13e+03
3.66
1.01
4
3.75
1.48
1
5
4
-0.789
0.24
0.0181
12
3.13e+03
3.57
1.15
4
3.68
1.48
1
5
4
-0.718
-0.221
0.0205
13
3.1e+03
3.97
0.946
4
4.09
1.48
1
5
4
-1.02
1.01
0.017
14
3.11e+03
4.24
0.791
4
4.35
1.48
1
5
4
-1.15
1.83
0.0142
15
3.17e+03
0.9
0.301
1
0.999
0
0
1
1
-2.66
5.07
0.00534
16
3.17e+03
0.349
0.477
0
0.311
0
0
1
1
0.635
-1.6
0.00847
17
3.17e+03
0.5
0.5
0
0.5
0
0
1
1
0.000631
-2
0.00888
18
3.17e+03
0.186
0.389
0
0.108
0
0
1
1
1.61
0.598
0.00692
19
3.17e+03
0.151
0.359
0
0.0646
0
0
1
1
1.94
1.78
0.00637
20
3.19e+03
3.77
1.16
4
3.92
1.48
0
6
6
-1.14
1.43
0.0205
21
3.17e+03
2.09
0.954
2
2.04
1.48
0
5
5
0.409
0.354
0.0169
22
3.16e+03
3.98
0.75
4
4.04
0.593
1
5
4
-0.944
1.01
0.0134
] --- ## Summary statistics: `summarytools::descr()` The [`summarytools` package](https://github.com/dcomtois/summarytools) provides a lot of functionalities for EDA, including the `descr()` function for summary statistics, which also provides quite a few options for customizing the output. ```r library(summarytools) corona_survey %>% select(left_right, starts_with("trust"), sum_measures, sum_sources, mean_trust) %>% descr(stats = "common") ``` .right[↪️] --- class: middle .small[ ``` ## Descriptive Statistics ## corona_survey ## N: 3765 ## ## left_right mean_trust sum_measures sum_sources trust_chancellor ## --------------- ------------ ------------ -------------- ------------- ------------------ ## Mean 4.66 3.98 3.77 2.09 3.57 ## Std.Dev 1.86 0.75 1.16 0.95 1.15 ## Min 0.00 1.00 0.00 0.00 1.00 ## Median 5.00 4.00 4.00 2.00 4.00 ## Max 10.00 5.00 6.00 5.00 5.00 ## N.Valid 3678.00 3157.00 3186.00 3169.00 3130.00 ## Pct.Valid 97.69 83.85 84.62 84.17 83.13 ## ## Table: Table continues below ## ## ## ## trust_government trust_rki trust_scientists trust_who ## --------------- ------------------ ----------- ------------------ ----------- ## Mean 3.66 4.44 4.24 3.97 ## Std.Dev 1.01 0.77 0.79 0.95 ## Min 1.00 1.00 1.00 1.00 ## Median 4.00 5.00 4.00 4.00 ## Max 5.00 5.00 5.00 5.00 ## N.Valid 3134.00 3095.00 3107.00 3103.00 ## Pct.Valid 83.24 82.20 82.52 82.42 ``` ] --- ## Summary statistics with the `datawizard` package 🧙 The [`datawizard` package](https://easystats.github.io/datawizard/) which we introduced as an interesting alternative or addition for data wrangling before also provides a function for summary statistics. Similar to `describe()` from `summarytools`, this function also offers several options for customizing the output. ```r library(datawizard) corona_survey %>% select(left_right, starts_with("trust"), sum_measures, sum_sources, mean_trust) %>% describe_distribution(quartiles = TRUE) ``` .right[↪️] --- class: middle .small[
Variable
Mean
SD
IQR
Min
Max
Q1
Q3
Skewness
Kurtosis
n
n_Missing
left_right
4.66
1.86
3
0
10
3
6
-0.103
-0.16
3678
87
trust_rki
4.44
0.774
1
1
5
4
5
-1.69
3.55
3095
670
trust_government
3.66
1.01
1
1
5
3
4
-0.79
0.244
3134
631
trust_chancellor
3.57
1.15
1
1
5
3
4
-0.719
-0.218
3130
635
trust_who
3.97
0.946
1
1
5
4
5
-1.02
1.02
3103
662
trust_scientists
4.24
0.791
1
1
5
4
5
-1.15
1.84
3107
658
sum_measures
3.77
1.16
2
0
6
3
5
-1.14
1.44
3186
579
sum_sources
2.09
0.954
2
0
5
1
3
0.409
0.359
3169
596
mean_trust
3.98
0.75
1
1
5
3.6
4.6
-0.945
1.02
3157
608
] --- ## Summary statistics with `dplyr` `dplyr` provides a helpful function for creating summary statistics: `summarize()` `summarize()` is a vectorized function that can be used to create summary statistics for variables using functions like... - `mean()` - `sd()` - `min()` - `max()` - etc. --- ## Summary statistics with `dplyr` While creating summary statistics using `summarize()` from `dplyr()` requires writing more code, it is the most flexible option. Another nice benefit of `dplyr::summarize()` is that it produces a `tibble` which can be used for further analyses or for creating plots or tables. --- ## `dplyr::summarize()` .small[ ```r corona_survey %>% summarize( mean_trust_gov = mean(trust_government, na.rm = TRUE), sd_trust_gov = sd(trust_government, na.rm = TRUE), var_trust_gov = var(trust_government, na.rm = TRUE), min_trust_gov = min(trust_government, na.rm = TRUE), max_trust_gov = max(trust_government, na.rm = TRUE) ) ```
mean_trust_gov
sd_trust_gov
var_trust_gov
min_trust_gov
max_trust_gov
3.66
1.01
1.03
1
5
] --- ## `dplyr::group_by()` The `dplyr` function `group_by()` creates dataframes (tibbles) that are grouped by one or more variables. This can, e.g., be used to produce grouped summary statistics. As we have seen for the `rowwise()` function in the session on advanced data wrangling, we can end/undo the grouping via the `ungroup()` function. ```r corona_survey %>% filter(!is.na(choice_of_party)) %>% group_by(choice_of_party) %>% summarize( mean_trust_gov = mean(trust_government, na.rm = TRUE), sd_trust_gov = sd(trust_government, na.rm = TRUE), var_trust_gov = var(trust_government, na.rm = TRUE), min_trust_gov = min(trust_government, na.rm = TRUE), max_trust_gov = max(trust_government, na.rm = TRUE) ) %>% ungroup() ``` .right[↪️] --- class: middle
choice_of_party
mean_trust_gov
sd_trust_gov
var_trust_gov
min_trust_gov
max_trust_gov
CDU/CSU
4.03
0.838
0.702
1
5
SPD
3.89
0.91
0.829
1
5
FDP
3.66
1.06
1.11
1
5
Linke
3.37
1.06
1.13
1
5
Gruene
3.87
0.804
0.647
1
5
AfD
2.83
1.21
1.46
1
5
Other
3.14
1.07
1.14
1
5
--- ## `dplyr::across()` To produce grouped summary statistics for multiple variables we can use the `dplyr` function `across()` which we already used in the session on advanced data wrangling. *Note*: We only use cases without missing data for any of the variables here (= listwise deletion). ```r corona_survey %>% select(choice_of_party, starts_with("trust")) %>% drop_na() %>% group_by(choice_of_party) %>% summarize(across(starts_with("trust"), list(mean = mean, sd = sd), .names = "{col}_{fn}")) ``` .right[↪️] --- class: middle
choice_of_party
trust_rki_mean
trust_rki_sd
trust_government_mean
trust_government_sd
trust_chancellor_mean
trust_chancellor_sd
trust_who_mean
trust_who_sd
trust_scientists_mean
trust_scientists_sd
CDU/CSU
4.57
0.648
4.02
0.837
4.04
0.938
4.1
0.825
4.28
0.769
SPD
4.57
0.647
3.87
0.91
3.81
1.06
4.11
0.859
4.36
0.768
FDP
4.46
0.705
3.62
1.05
3.38
1.22
3.92
0.999
4.28
0.81
Linke
4.36
0.858
3.35
1.05
3.3
1.13
3.91
0.927
4.19
0.791
Gruene
4.52
0.696
3.86
0.802
3.84
0.89
4.15
0.78
4.42
0.667
AfD
4.1
0.923
2.82
1.19
2.42
1.24
3.24
1.21
3.93
0.858
Other
4.24
0.964
3.16
1.05
3
1.08
3.78
1.05
4.09
0.876
--- ## `dplyr::across()` <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\dplyr_across.png" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations) </small></small> --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_3_1_1_Summary_Statisticss.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_3_1_1_Summary_Statistics.html) --- ## Frequencies: `table()` A simple way of looking at frequencies (e.g., for categorical variables) is the `base R` function `table()`. ```r table(corona_survey$choice_of_party) ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other ## 753 369 254 287 756 281 83 ``` If you also want to include `NA` in the frequency counts, you need to specify the argument `useNA = "always"`. ```r table(corona_survey$choice_of_party, useNA = "always") ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other <NA> ## 753 369 254 287 756 281 83 982 ``` --- ## Proportions with `prop.table()` If you want proportions instead of raw counts, you can use the `base R` function `prop.table()`. **NB**: You need to apply this function to an output produced by `table()`. .small[ ```r prop.table(table(corona_survey$choice_of_party)) ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other ## 0.27057133 0.13259073 0.09126842 0.10312612 0.27164930 0.10097018 0.02982393 ``` ```r prop.table(table(corona_survey$choice_of_party, useNA = "always")) ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other <NA> ## 0.20000000 0.09800797 0.06746348 0.07622842 0.20079681 0.07463479 0.02204515 0.26082337 ``` ] --- ## Proportions with `prop.table()` If you want fewer decimals places in the output, you can wrap the the `prop.table()` function in a `round()` call. ```r round(prop.table(table(corona_survey$choice_of_party, useNA = "always")), 3) # rounded to 3 decimal places ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other <NA> ## 0.200 0.098 0.067 0.076 0.201 0.075 0.022 0.261 ``` Or if you want percentages... ```r round((prop.table(table(corona_survey$choice_of_party, useNA = "always")) * 100), 2) ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other <NA> ## 20.00 9.80 6.75 7.62 20.08 7.46 2.20 26.08 ``` --- ## Frequencies and proportions with `summarytools::freq()` The `summarytools` package which we used before for summary statistics also includes the `freq()` function for frequency tables. .small[ ```r library(summarytools) freq(corona_survey$choice_of_party) ``` ``` ## Frequencies ## corona_survey$choice_of_party ## Type: Factor ## ## Freq % Valid % Valid Cum. % Total % Total Cum. ## ------------- ------ --------- -------------- --------- -------------- ## CDU/CSU 753 27.06 27.06 20.00 20.00 ## SPD 369 13.26 40.32 9.80 29.80 ## FDP 254 9.13 49.44 6.75 36.55 ## Linke 287 10.31 59.76 7.62 44.17 ## Gruene 756 27.16 86.92 20.08 64.25 ## AfD 281 10.10 97.02 7.46 71.71 ## Other 83 2.98 100.00 2.20 73.92 ## <NA> 982 26.08 100.00 ## Total 3765 100.00 100.00 100.00 100.00 ``` ] --- ## Frequencies and proportions with `janitor::tabyl()` The [`janitor` package](https://github.com/sfirke/janitor) that we briefly mentioned in the session on *Data Wrangling Basics* also provides the `tabyl()` function for creating frequency and proportion tables: .small[ ```r library(janitor) party_stats <- corona_survey %>% tabyl(choice_of_party) %>% adorn_pct_formatting(digits = 2, affix_sign = TRUE) party_stats ```
choice_of_party
n
percent
valid_percent
CDU/CSU
753
20.00%
27.06%
SPD
369
9.80%
13.26%
FDP
254
6.75%
9.13%
Linke
287
7.62%
10.31%
Gruene
756
20.08%
27.16%
AfD
281
7.46%
10.10%
Other
83
2.20%
2.98%
982
26.08%
-
] --- ## Frequencies and proportions with `janitor::tabyl()` A nice thing about `tabyl()` is that is produces a dataframe which we can, e.g., use for plotting or creating tables (which we will discuss later on). ```r class(party_stats) ``` ``` ## [1] "tabyl" "data.frame" ``` --- ## Frequencies and proportions with `dplyr` We can also use `group_by()` and `summarize()` to get frequencies and proportions for variables in our dataset. ```r corona_survey %>% filter(!is.na(choice_of_party)) %>% group_by(choice_of_party) %>% summarize(n = n()) %>% mutate(proportion = n/sum(n)) %>% ungroup() ```
choice_of_party
n
proportion
CDU/CSU
753
0.271
SPD
369
0.133
FDP
254
0.0913
Linke
287
0.103
Gruene
756
0.272
AfD
281
0.101
Other
83
0.0298
--- ## Frequencies and proportions with `dplyr` Instead of using `group_by` and `summarize()` to get frequency counts, we can also use `count()` from `dplyr` as a shorthand. ```r corona_survey %>% filter(!is.na(choice_of_party)) %>% count(choice_of_party) %>% mutate(proportion = n/sum(n)) %>% ungroup() ```
choice_of_party
n
proportion
CDU/CSU
753
0.271
SPD
369
0.133
FDP
254
0.0913
Linke
287
0.103
Gruene
756
0.272
AfD
281
0.101
Other
83
0.0298
--- ## Tables in `R` There are typically two types of outputs you can produce with `R` for further use in reports or publications: tables and plots. We will cover the creation of plots in the following session(s) on data visualization. However, as summary statistics, frequencies, and proportions are often presented in tables, we will briefly discuss how to create tables as output in `R` in the following. As with almost everything in `R`, there are many options (read: packages) for creating tables. We will show examples from three popular options: - [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html) - [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/) + [`flextable`](https://davidgohel.github.io/flextable/index.html) --- ## Summary tables with `stargazer` While there is an ever-growing list of `R` packages for creating tables with many new (promising) contenders (such as [`flextable`](https://davidgohel.github.io/flextable/index.html) or [`gt`](https://gt.rstudio.com/index.html)), `stargazer` is an established and widely-used tool for creating ACSII (plain text), `LaTeX`, and `HTML` table output. --- ## Summary tables with `stargazer` If we, e.g., want to create a summary statistics table as text output (e.g., for printing in the `R` console), we can use `stargazer` for that. **NB**: As the main `stargazer()` function does not work with tibbles, we need to convert our data to a regular dataframe. ```r corona_survey %>% select(starts_with("trust")) %>% as.data.frame() %>% stargazer(type = "text", digits = 2, title="Descriptive statistics") ``` ``` ## ## Descriptive statistics ## ================================================================ ## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max ## ---------------------------------------------------------------- ## trust_rki 3,095 4.44 0.77 1.00 4.00 5.00 5.00 ## trust_government 3,134 3.66 1.01 1.00 3.00 4.00 5.00 ## trust_chancellor 3,130 3.57 1.15 1.00 3.00 4.00 5.00 ## trust_who 3,103 3.97 0.95 1.00 4.00 5.00 5.00 ## trust_scientists 3,107 4.24 0.79 1.00 4.00 5.00 5.00 ## ---------------------------------------------------------------- ``` --- ## Summary tables with `stargazer` `stargarer` also supports `HTML` and `LaTeX` as output formats. We can also export the output for further use (e.g., with our `LaTeX` editor of choice). ```r # We create a directory for output files first dir.create("./output") corona_survey %>% select(starts_with("trust")) %>% as.data.frame() %>% stargazer(type = "latex", digits = 2, out = "./output/stargazer_summary.tex", title="Descriptive statistics") ``` *Note*: If you look at the help file for the `stargazer()` function (via `?stargazer`), you will see that it provides a lot of customization options. --- ## Summary tables with `gtsummary` The `tbl_summary()` function from the [`gtsummary` package](http://www.danieldsjoberg.com/gtsummary/) can, e.g., be used to produce frequency table output. If you run the following code in *RStudio*, the table will be displayed in the `Viewer` pane and can then be exported from there as an image or an `HTML` file. ```r corona_survey %>% select(sex, age_cat, education_cat) %>% tbl_summary() ``` *Note*: As with `stargazer`, `tbl_summary()` also provides various customization options and you can save its output in different formats, including `.html`, `.rtf`, and `.tex`. --- ## Summary tables with `gtsummary` You can also save the output of `tbl_summary()` directly in a *Word* file. For that to work, however, you also need to use a function from the [`flextable` package](https://davidgohel.github.io/flextable/index.html). ```r library(flextable) corona_survey %>% select(sex, age_cat, education_cat) %>% tbl_summary(label = list(sex ~ "Sex", age_cat ~ "Age category", education_cat ~ "Education category")) %>% as_flex_table() %>% save_as_docx("Sample descriptives" = ., path = "./output/gt_summary.docx") ``` *Note*: If you have not done this before, for this code to work, you need to create an `output` folder in your current working directory using `dir.create("./output")`. --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_3_1_2_Frequencies_Proportions.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_3_1_2_Frequencies_Proportions.html) --- ## Relationships between variables In addition to checking summary statistics for individual variables, another thing that you quite possibly also want to look at as part of your exploratory data analysis (EDA) are the relationships between (specific) variables in your data set. There are many ways to do so and the appropriate choice of methods, of course, depends on the types of variables you want to explore. In the following, we will briefly discuss some options for two methods of exploring relationships between variables: - crosstabulation (for categorical variables) - correlations (for numeric and/or binary variables) --- ## Crosstabs You can also use the `base R` functions `table()` and `prop.table()` we have used before for creating univariate frequency and proportion tables to generate crosstabs. ```r table(corona_survey$sex, corona_survey$choice_of_party) # rows, columns ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other ## Male 413 221 160 170 315 204 52 ## Female 340 148 94 117 441 77 31 ``` ```r round(prop.table(table(corona_survey$sex, corona_survey$choice_of_party))*100, 2) ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other ## Male 14.84 7.94 5.75 6.11 11.32 7.33 1.87 ## Female 12.22 5.32 3.38 4.20 15.85 2.77 1.11 ``` --- ## Crosstabs We can also calculate row or column percentages using these `base R` functions. ```r round(prop.table(table(corona_survey$sex, corona_survey$choice_of_party), 1)*100, 2) # row percentages ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other ## Male 26.91 14.40 10.42 11.07 20.52 13.29 3.39 ## Female 27.24 11.86 7.53 9.38 35.34 6.17 2.48 ``` ```r round(prop.table(table(corona_survey$sex, corona_survey$choice_of_party), 2)*100, 2) # column percentages ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other ## Male 54.85 59.89 62.99 59.23 41.67 72.60 62.65 ## Female 45.15 40.11 37.01 40.77 58.33 27.40 37.35 ``` .small[ *Note*: If you want to generate tables based on more than two variables, the `base R` function `ftable()` is a good option for prettier printing of results. ] --- ## Crosstabs with `dplyr` We can also use functions from `dplyr` to create crosstabs including frequencies and proportions. ```r corona_survey %>% filter(!is.na(choice_of_party)) %>% count(sex, choice_of_party) %>% pivot_wider(names_from = choice_of_party, values_from = n) ```
sex
CDU/CSU
SPD
FDP
Linke
Gruene
AfD
Other
Male
413
221
160
170
315
204
52
Female
340
148
94
117
441
77
31
.small[ *Note*: We have only briefly mentioned the function in the session on *Data Wrangling Basics* (when introducing the concept of tidy data), but - as you might guess from its name - the function `pivot_wider()`, which is part of the [`tidyr` package](https://tidyr.tidyverse.org/) changes the format of a data set from long to wide. ] --- ## Crosstabs with `dplyr` We can also use functions from `dplyr` in a similar fashion as we have done for a single variable to create crosstabs including frequencies and proportions. ```r corona_survey %>% filter(!is.na(choice_of_party)) %>% count(sex, choice_of_party) %>% mutate(proportion = n/sum(n)*100) %>% select(-n) %>% pivot_wider(names_from = choice_of_party, values_from = proportion) ```
sex
CDU/CSU
SPD
FDP
Linke
Gruene
AfD
Other
Male
14.8
7.94
5.75
6.11
11.3
7.33
1.87
Female
12.2
5.32
3.38
4.2
15.8
2.77
1.11
--- ## Other options for crosstabulation in `R` As with most things in `R`, there are many options for creating crosstabs. Some alternatives to the ones we presented before include the `CrossTable()` and `crosstab()` functions from the [`descr` package](https://cran.r-project.org/web/packages/descr/index.html) or the `ctable()` function from the [`summarytools` package](https://github.com/dcomtois/summarytools). A very versatile option for crosstabs is the `tabyl()` function from the `janitor` package that we have introduced before. --- ## Crosstabs with the `janitor` package The `tabyl()` function from the `janitor` package provides quite a few options for crosstabs. We will only show one example here, but you can learn more in the [`tabyl` vignette](https://cran.r-project.org/web/packages/janitor/vignettes/tabyls.html). .small[ ```r library(janitor) corona_survey %>% filter(!is.na(choice_of_party)) %>% tabyl(sex, choice_of_party) %>% adorn_totals(where = c("row","col")) %>% adorn_percentages(denominator = "row") %>% adorn_pct_formatting(digits = 2) %>% adorn_ns(position = "front") ```
sex
CDU/CSU
SPD
FDP
Linke
Gruene
AfD
Other
Total
Male
413 (26.91%)
221 (14.40%)
160 (10.42%)
170 (11.07%)
315 (20.52%)
204 (13.29%)
52 (3.39%)
1535 (100.00%)
Female
340 (27.24%)
148 (11.86%)
94 (7.53%)
117 (9.38%)
441 (35.34%)
77 (6.17%)
31 (2.48%)
1248 (100.00%)
Total
753 (27.06%)
369 (13.26%)
254 (9.13%)
287 (10.31%)
756 (27.16%)
281 (10.10%)
83 (2.98%)
2783 (100.00%)
] --- ## Chi-Square Test We can use the `summary()` function from `base R` in combination with `table()` to perform a chi-square test. ```r summary(table(corona_survey$sex, corona_survey$choice_of_party)) ``` ``` ## Number of cases in table: 2783 ## Number of factors: 2 ## Test for independence of all factors: ## Chisq = 103.67, df = 6, p-value = 4.292e-20 ``` --- ## Chi-Square Test The other packages that include functions for crosstabs mentioned before can also be used for chi-square tests: For example, the `janitor` package. ```r library(janitor) corona_survey %>% filter(!is.na(choice_of_party)) %>% tabyl(sex, choice_of_party) %>% chisq.test() ``` ``` ## ## Pearson's Chi-squared test ## ## data: . ## X-squared = 103.67, df = 6, p-value < 2.2e-16 ``` --- ## Correlations Again, as with the crosstabs examples, there are many different options for calculating and displaying correlations in `R`. In addition to the `base R` functions, we will look at two packages in this part: [`corrr`](https://corrr.tidymodels.org/) and [`correlation`](https://github.com/easystats/correlation). *Note*: While we will not cover that in this session, `corrr` and `correlation` also offer some nice options for plotting correlations. --- # Correlations with `base R` The `base R` function `cor()` computes the correlation coefficient(s) between two or more variables. This function can be used to calculate *Pearson's r*, *Kendall's tau*, and *Spearman's rho*. We also need to specify how we want to deal with missing values (e.g., use pairwise complete observations). For example, let's look at the correlations between the trust variables in our dataset: .small[ ```r trust <- corona_survey %>% select(starts_with("trust")) cor(trust, use = "pairwise.complete.obs", method = "pearson") ``` ``` ## trust_rki trust_government trust_chancellor trust_who trust_scientists ## trust_rki 1.0000000 0.5233628 0.4566379 0.5234133 0.4919817 ## trust_government 0.5233628 1.0000000 0.8700270 0.5686134 0.4475961 ## trust_chancellor 0.4566379 0.8700270 1.0000000 0.5322833 0.3899243 ## trust_who 0.5234133 0.5686134 0.5322833 1.0000000 0.5117994 ## trust_scientists 0.4919817 0.4475961 0.3899243 0.5117994 1.0000000 ``` ] --- ## Correlations with `base R` With `corr.test()` you can display the results of a significance test for a correlation. ```r cor.test(trust$trust_rki, trust$trust_scientists, method = "pearson") ``` ``` ## ## Pearson's product-moment correlation ## ## data: trust$trust_rki and trust$trust_scientists ## t = 31.25, df = 3058, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.4646480 0.5183788 ## sample estimates: ## cor ## 0.4919817 ``` --- ## The `corrr` package The [`corrr` package](https://corrr.tidymodels.org/) is part of the [`tidymodels` suite of packages](https://www.tidymodels.org/) and provides various functions for displaying correlations. The main function is `correlate()` which produces a `tibble` as output. .small[ ```r library(corrr) correlate(trust) ```
term
trust_rki
trust_government
trust_chancellor
trust_who
trust_scientists
trust_rki
0.523
0.457
0.523
0.492
trust_government
0.523
0.87
0.569
0.448
trust_chancellor
0.457
0.87
0.532
0.39
trust_who
0.523
0.569
0.532
0.512
trust_scientists
0.492
0.448
0.39
0.512
] --- ## The `corrr` package The `corrr` package provides several functions for tweaking/optimizing the output of the `correlate()` function. Here's one example: .small[ ```r trust %>% correlate() %>% rearrange() %>% shave() %>% fashion() ```
term
trust_scientists
trust_rki
trust_who
trust_government
trust_chancellor
trust_scientists
trust_rki
.49
trust_who
.51
.52
trust_government
.45
.52
.57
trust_chancellor
.39
.46
.53
.87
] --- ## The `correlation` package The [`correlation` package](https://github.com/easystats/correlation) is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/) (also called the `Easyverse`). It provides a much wider range of correlation types than the `base R` function `cor()` and the `correlate()` function from the `corrr` package (e.g., biserial and tetrachoric correlations for factors). Its core is the `correlation()` function. ```r library(correlation) correlation(trust) ``` .right[↪️] --- class: middle
Parameter1
Parameter2
r
CI
CI_low
CI_high
t
df_error
p
Method
n_Obs
trust_rki
trust_government
0.523
0.95
0.497
0.549
34.1
3076
3.08e-215
Pearson correlation
3078
trust_rki
trust_chancellor
0.457
0.95
0.428
0.484
28.4
3072
9.56e-158
Pearson correlation
3074
trust_rki
trust_who
0.523
0.95
0.497
0.549
33.9
3053
9.41e-214
Pearson correlation
3055
trust_rki
trust_scientists
0.492
0.95
0.465
0.518
31.2
3058
1.1e-185
Pearson correlation
3060
trust_government
trust_chancellor
0.87
0.95
0.861
0.878
98.5
3113
0
Pearson correlation
3115
trust_government
trust_who
0.569
0.95
0.544
0.592
38.4
3083
7.79e-263
Pearson correlation
3085
trust_government
trust_scientists
0.448
0.95
0.419
0.475
27.8
3089
6.95e-152
Pearson correlation
3091
trust_chancellor
trust_who
0.532
0.95
0.506
0.557
34.9
3079
4e-224
Pearson correlation
3081
trust_chancellor
trust_scientists
0.39
0.95
0.36
0.419
23.5
3084
1.32e-112
Pearson correlation
3086
trust_who
trust_scientists
0.512
0.95
0.485
0.537
33
3066
8.59e-204
Pearson correlation
3068
--- ## The `correlation` package Among other things, the `correlation` package allows to calculate grouped/stratified correlations. ```r corona_survey %>% select(sex, starts_with("trust")) %>% group_by(sex) %>% correlation() ``` .right[↪️] --- class: center, middle .small[
Group
Parameter1
Parameter2
r
CI
CI_low
CI_high
t
df_error
p
Method
n_Obs
Male
trust_rki
trust_government
0.517
0.95
0.48
0.552
24
1577
8.93e-108
Pearson correlation
1579
Male
trust_rki
trust_chancellor
0.448
0.95
0.408
0.486
19.9
1575
3.56e-78
Pearson correlation
1577
Male
trust_rki
trust_who
0.493
0.95
0.454
0.529
22.4
1566
6.76e-96
Pearson correlation
1568
Male
trust_rki
trust_scientists
0.456
0.95
0.415
0.494
20.3
1566
1.34e-80
Pearson correlation
1568
Male
trust_government
trust_chancellor
0.862
0.95
0.848
0.874
67.8
1596
0
Pearson correlation
1598
Male
trust_government
trust_who
0.562
0.95
0.528
0.595
27.1
1584
6.94e-132
Pearson correlation
1586
Male
trust_government
trust_scientists
0.41
0.95
0.368
0.45
17.9
1583
7.11e-65
Pearson correlation
1585
Male
trust_chancellor
trust_who
0.54
0.95
0.504
0.574
25.5
1582
1.46e-119
Pearson correlation
1584
Male
trust_chancellor
trust_scientists
0.358
0.95
0.315
0.401
15.3
1580
3.99e-49
Pearson correlation
1582
Male
trust_who
trust_scientists
0.493
0.95
0.455
0.53
22.5
1572
1.53e-96
Pearson correlation
1574
Female
trust_rki
trust_government
0.532
0.95
0.494
0.567
24.3
1497
1.86e-109
Pearson correlation
1499
Female
trust_rki
trust_chancellor
0.47
0.95
0.429
0.508
20.6
1495
1.22e-82
Pearson correlation
1497
Female
trust_rki
trust_who
0.57
0.95
0.535
0.604
26.8
1485
4.23e-128
Pearson correlation
1487
Female
trust_rki
trust_scientists
0.535
0.95
0.498
0.57
24.4
1490
1.69e-110
Pearson correlation
1492
Female
trust_government
trust_chancellor
0.881
0.95
0.869
0.891
72.3
1515
0
Pearson correlation
1517
Female
trust_government
trust_who
0.574
0.95
0.54
0.607
27.2
1497
2.14e-131
Pearson correlation
1499
Female
trust_government
trust_scientists
0.496
0.95
0.456
0.533
22.1
1504
1.2e-93
Pearson correlation
1506
Female
trust_chancellor
trust_who
0.512
0.95
0.474
0.549
23.1
1495
2.66e-100
Pearson correlation
1497
Female
trust_chancellor
trust_scientists
0.434
0.95
0.392
0.474
18.7
1502
4.85e-70
Pearson correlation
1504
Female
trust_who
trust_scientists
0.549
0.95
0.513
0.584
25.4
1492
1.41e-117
Pearson correlation
1494
] --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_3_1_3_Crosstabs_Correlations.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_3_1_3_Crosstabs_Correlations.html) --- ## Guilty by ~~association~~ correlation While correlation coefficients are useful for exploring relationships between variables, they can also be misleading. For example, if we do correlation analysis and we encounter a (Pearson's) correlation coefficient close to 0, we often think of relationships as pictured below. <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-1-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation This dataset has **the same correlation coefficient (Pearson's r of -0.06)** as the one on the previous slide: <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-2-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation 🦖 So does this one... <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-3-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation We could go on... The previous three examples all come from the [`datasauRus` package](https://github.com/lockedata/datasauRus) which essentially is an extension of [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) and includes 13 data sets with the same (Pearson) correlation between x and y. <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-4-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Trust no singular value! Importantly, the x- and y-variables in these `datasaurus_dozen` data set also all have the same means and standard deviations. ```r datasaurus_dozen %>% group_by(dataset) %>% summarize( mean_x = mean(x), mean_y = mean(y), sd_x = sd(x), sd_y = sd(y), corr = cor(x, y, method = "pearson") ) ``` .right[↪️] --- class: middle .small[
dataset
mean_x
mean_y
sd_x
sd_y
corr
away
54.3
47.8
16.8
26.9
-0.0641
bullseye
54.3
47.8
16.8
26.9
-0.0686
circle
54.3
47.8
16.8
26.9
-0.0683
dino
54.3
47.8
16.8
26.9
-0.0645
dots
54.3
47.8
16.8
26.9
-0.0603
h_lines
54.3
47.8
16.8
26.9
-0.0617
high_lines
54.3
47.8
16.8
26.9
-0.0685
slant_down
54.3
47.8
16.8
26.9
-0.069
slant_up
54.3
47.8
16.8
26.9
-0.0686
star
54.3
47.8
16.8
26.9
-0.063
v_lines
54.3
47.8
16.8
26.9
-0.0694
wide_lines
54.3
47.8
16.8
26.9
-0.0666
x_shape
54.3
47.8
16.8
26.9
-0.0656
] --- ## Plot your data! The message from the `datasaurus_dozen` examples should be clear. Relying only on singular values that summarize the location or spread of a single variable or the association of two variables is not a good idea. To avoid reducing a ~~mountain to a molehill~~ dinosaur to a lack of correlation, it is important to plot your data as part of your EDA to explore: - univariate distributions - grouped univariate distributions (if you have and want to compare groups) - bivariate distributions For that reason (and also because it is fun and `R` has a lot to offer in that regard), we will dive right into data visualization with `R` in the next session... --- ## Missings & outliers Two things that can influence summary statistics as well as univariate and bivariate distributions are missings and outliers. Hence, before we can analyze our data, we should check whether we have clear patterns of missingness or extreme outliers in our data. --- ## Missing data In the *Data Wrangling Basics* session we have already discussed how we can define specific values as missings (`NA`) and how we can recode `NA` into something else. As we have seen in that session, the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* dataset contains quite a few codes for different types of missing data. However, when we collect data ourselves or if we (re-)use datasets that are not as well-documented as the *GESIS Panel* data, we may need to explore potential patterns of missingness to see, if there may have been identifiable reasons why data for certain variables and/or observations are missing. --- ## Summarizing missingness The [`naniar` package](http://naniar.njtierney.com/index.html) that we have used for recoding values as `NA` in the *Data Wrangling Basics* session also provides a [collection of functions for summarizing missingness](http://naniar.njtierney.com/reference/miss_var_summary.html) in variables or whole data sets, of which we will show three examples in the following. ```r library(naniar) corona_survey %>% select(starts_with("trust")) %>% miss_var_summary(order = TRUE) ```
variable
n_miss
pct_miss
trust_rki
670
17.8
trust_who
662
17.6
trust_scientists
658
17.5
trust_chancellor
635
16.9
trust_government
631
16.8
--- ## Summarizing missingness with `naniar` ```r corona_survey %>% pct_complete_case() ``` ``` ## [1] 58.91102 ``` --- ## Summarizing missingness with `naniar` ```r corona_survey %>% miss_case_table() ```
n_miss_in_case
n_cases
pct_cases
0
2218
58.9
1
796
21.1
2
102
2.71
3
23
0.611
4
8
0.212
5
5
0.133
6
12
0.319
7
8
0.212
12
8
0.212
13
5
0.133
14
2
0.0531
19
2
0.0531
20
2
0.0531
21
350
9.3
22
187
4.97
23
37
0.983
--- ## Outliers There are [many ways of identifying and dealing with outliers](https://ouzhang.me/2020/11/16/outliers-part4/). Here, we will only look at one simple univariate method: Using [interquartile range (IQR)](https://en.wikipedia.org/wiki/Interquartile_range) for identifying outliers. As this works better with numeric data on an [interval scale](https://en.wikipedia.org/wiki/Level_of_measurement#Interval_scale), we will use data from [*Gapminder*](https://www.gapminder.org/) on GDP per capita in 2018 for identifying outliers. ```r gdp_pc_2018 <- read_csv("./data/gdppercapita_us_inflation_adjusted.csv") %>% pivot_longer(-country, names_to = "year", values_to = "gdp_percap") %>% filter(year == "2018") %>% select(-year) ``` --- ## Detecting outliers: IQR For identifying univariate outliers we can specify lower and upper cutoffs, e.g., using the formula 25th percentile - 1.5 * interquartile range (IQR) for the lower and 75th percentile + 1.5 * IQR for the upper limit. In `R` this can be done as follows: .small[ ```r q2575 <- quantile(gdp_pc_2018$gdp_percap, probs=c(.25, .75), na.rm = TRUE) iqr <- IQR(gdp_pc_2018$gdp_percap, na.rm = TRUE) ul <- q2575[2]+1.5*iqr ll <- q2575[1]-1.5*iqr gdp_pc_2018_cut <- gdp_pc_2018 %>% filter(gdp_percap <= ul) nrow(gdp_pc_2018) - nrow(gdp_pc_2018_cut) ``` ``` ## [1] 31 ``` ] .small[ **NB**: The value of 1.5 x IQR is a rule of thumb. You should always check whether this makes sense for your data. In the above example with the *Gapminder* data, e.g., it most likely does not. An alternative approach is using mean + 2 x SD (or 3 x SD), but, again, this is nothing more than a rule of thumb. ] --- ## Other packages for EDA While we have covered quite a few packages (that can be used) for EDA in this session, there still are plenty of others that also provide some interesting functionalities. Some of these additional/alternative options are: - [`inspectdf`](https://alastairrushworth.github.io/inspectdf/) - [`skimr`](https://docs.ropensci.org/skimr/) - [`descriptr`](https://descriptr.rsquaredacademy.com/index.html) - [`DataExplorer`](https://boxuancui.github.io/DataExplorer/) - [`explore`](https://github.com/rolkra/explore) - [`dataReporter`](https://github.com/ekstroem/dataReporter) --- ## Automated EDA reports Some of the EDA packages provide functions for generating automated EDA reports, for example: - the `skim()` function from the [`skimr` package](https://docs.ropensci.org/skimr/) - the `make_report()` function from the [`dataReporter` package](https://github.com/ekstroem/dataReporter) - the `report()` function from the [`explore` package](https://github.com/rolkra/explore) The function `explore`() from the `explore` package also allows for interactive data exploration. As we have seen throughout this session, there are quite a few `R` packages (that can be used) for EDA. Some of them provide wrapper functions for generating automated EDA reports. Some of the options for this include: --- # Extracurricular activities Further explore the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* dataset and develop some ideas for interesting research questions that you could answer by analyzing it with `R`.